Before importing the data, we preprocessed the raw data in Python to obtain a nicer data frame as the raw data contains some columns written in JSON format with many attributes.
## 'data.frame': 4803 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 21 levels "","Action","Adventure",..: 2 3 2 2 2 10 4 2 3 2 ...
## $ popularity : num 150.4 139.1 107.4 112.3 43.9 ...
## $ production_companies: Factor w/ 1314 levels "","100 Bares",..: 615 1263 265 696 1263 265 1263 758 1267 320 ...
## $ release_date : Factor w/ 3281 levels "","1916-09-04",..: 2315 1945 3185 2688 2635 1940 2450 3111 2246 3234 ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 4800 levels "(500) Days of Summer",..: 381 2653 3186 3614 1906 3198 3364 382 1587 444 ...
## $ vote_average : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote_count : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ Number_Genres : int 4 3 3 4 3 3 2 3 3 3 ...
## 'data.frame': 3225 obs. of 15 variables:
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
## $ popularity: num 150.4 139.1 107.4 112.3 43.9 ...
## $ company : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
## $ date : Date, format: "2009-12-10" "2007-05-19" ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
## $ score : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ profit : num 2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
## $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ season : Factor w/ 4 levels "Spring","Summer",..: 4 1 3 2 1 1 3 1 2 1 ...
## $ quarter : Factor w/ 4 levels "Q1","Q2","Q3",..: 4 2 4 3 1 2 4 2 3 1 ...
## $ year : num 2009 2007 2015 2012 2012 ...
## budget genres popularity
## Min. :1.00e+00 Drama :745 Min. : 0
## 1st Qu.:1.05e+07 Comedy :634 1st Qu.: 10
## Median :2.50e+07 Action :588 Median : 20
## Mean :4.07e+07 Adventure:288 Mean : 29
## 3rd Qu.:5.50e+07 Horror :197 3rd Qu.: 37
## Max. :3.80e+08 Crime :141 Max. :876
## (Other) :632
## company date revenue
## Others :1636 Min. :1916-09-04 Min. :5.00e+00
## Paramount Pictures: 255 1st Qu.:1998-09-10 1st Qu.:1.71e+07
## Sony Pictures : 277 Median :2005-07-20 Median :5.52e+07
## Universal Pictures: 338 Mean :2002-03-18 Mean :1.21e+08
## Walt Disney : 497 3rd Qu.:2010-11-11 3rd Qu.:1.46e+08
## Warner Bros : 222 Max. :2016-09-09 Max. :2.79e+09
##
## runtime title score
## Min. : 41 The Host : 2 Min. :2.30
## 1st Qu.: 96 (500) Days of Summer : 1 1st Qu.:5.80
## Median :107 [REC] : 1 Median :6.30
## Mean :111 [REC]² : 1 Mean :6.31
## 3rd Qu.:121 10 Cloverfield Lane : 1 3rd Qu.:6.90
## Max. :338 10 Things I Hate About You: 1 Max. :8.50
## (Other) :3218
## vote profit profitable season quarter
## Min. : 1 Min. :-1.66e+08 0: 787 Spring:704 Q1:656
## 1st Qu.: 179 1st Qu.: 2.52e+05 1:2438 Summer:837 Q2:757
## Median : 471 Median : 2.64e+07 Fall :930 Q3:931
## Mean : 978 Mean : 8.07e+07 Winter:754 Q4:881
## 3rd Qu.: 1148 3rd Qu.: 9.75e+07
## Max. :13752 Max. : 2.55e+09
##
## year
## Min. :1916
## 1st Qu.:1998
## Median :2005
## Mean :2002
## 3rd Qu.:2010
## Max. :2016
##
## revenue budget popularity runtime
## Min. :5.00e+00 Min. :1.00e+00 Min. : 0 Min. : 41
## 1st Qu.:1.71e+07 1st Qu.:1.05e+07 1st Qu.: 10 1st Qu.: 96
## Median :5.52e+07 Median :2.50e+07 Median : 20 Median :107
## Mean :1.21e+08 Mean :4.07e+07 Mean : 29 Mean :111
## 3rd Qu.:1.46e+08 3rd Qu.:5.50e+07 3rd Qu.: 37 3rd Qu.:121
## Max. :2.79e+09 Max. :3.80e+08 Max. :876 Max. :338
## score vote profit
## Min. :2.30 Min. : 1 Min. :-1.66e+08
## 1st Qu.:5.80 1st Qu.: 179 1st Qu.: 2.52e+05
## Median :6.30 Median : 471 Median : 2.64e+07
## Mean :6.31 Mean : 978 Mean : 8.07e+07
## 3rd Qu.:6.90 3rd Qu.: 1148 3rd Qu.: 9.75e+07
## Max. :8.50 Max. :13752 Max. : 2.55e+09
Variance and SD
## revenue budget popularity runtime score vote
## 1.86e+08 4.44e+07 3.62e+01 2.10e+01 8.60e-01 1.41e+03
## profit
## 1.58e+08
## revenue budget popularity runtime score vote
## 3.47e+16 1.97e+15 1.31e+03 4.40e+02 7.39e-01 2.00e+06
## profit
## 2.50e+16
The means, variance and sd between variables are quite high as most of them have different scales. We need to scale the data for some models like linear regression, PCR, KNN, etc.
Numnber of movies by company
Number of movie by season
Test frequency distributions of revenue in different genres
Overall, there is an evidence that the frequency distributions of revenue in different genres are not the same. It seems that revenue is dependent on genres.
Check freq disbutions of revenue in different companies
Overall, there is an evidence that the frequency distributions of revenue in different companies are not the same. It seems that revenue is dependent on companies.
It seems that winter - fall are in the same group and spring - summer are in the same group.
Construct the model on training set (using all numberical variables)
##
## Call:
## lm(formula = revenue ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.89e+07 -1.92e+06 2.46e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122161921 2168480 56.34 < 2e-16 ***
## budget 82502895 2822244 29.23 < 2e-16 ***
## popularity 14588304 2952684 4.94 8.4e-07 ***
## runtime -1265467 2415512 -0.52 0.60
## score 212212 2648862 0.08 0.94
## vote 85807055 3723449 23.05 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.07e+03 on 5 and 2170 DF, p-value: <2e-16
## budget popularity runtime score vote
## 1.68 2.15 1.26 1.50 2.95
Prediction
Testing
## mae rmse
## 6.25e+07 1.06e+08
Training
## mae rmse
## 5.86e+07 1.01e+08
All three feature selection methods show that predictors (budget + popularity + vote) form the best model.
Construct the model on train set
##
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.85e+07 -2.19e+06 2.44e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.22e+08 2.17e+06 56.36 < 2e-16 ***
## budget 8.23e+07 2.62e+06 31.45 < 2e-16 ***
## popularity 1.46e+07 2.95e+06 4.96 7.8e-07 ***
## vote 8.57e+07 3.47e+06 24.72 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.78e+03 on 3 and 2172 DF, p-value: <2e-16
## budget popularity vote
## 1.45 2.15 2.55
Prediction
## mae rmse
## 6.25e+07 1.06e+08
## mae rmse
## 5.86e+07 1.01e+08
No change when comparing to the model containing all numerical variables. We can remove unnecessary variables without reducing Adjusted R-squared or increasing RMSE. The best model is less complex and can avoid overfitting due to many predictors (high dimensionality).
##
## Call:
## lm(formula = revenue ~ ., data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.08e+08 -4.03e+07 -1.43e+06 2.90e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108110505 6817790 15.86 < 2e-16 ***
## budget 77348867 3033574 25.50 < 2e-16 ***
## popularity 13617667 2932282 4.64 3.6e-06 ***
## runtime 5900196 2594312 2.27 0.02305 *
## score -1602884 2751829 -0.58 0.56030
## vote 87028064 3716646 23.42 < 2e-16 ***
## genresAdventure 14998048 8669568 1.73 0.08378 .
## genresAnimation 87790816 13236699 6.63 4.2e-11 ***
## genresComedy 25268186 7163941 3.53 0.00043 ***
## genresCrime -10102172 11467478 -0.88 0.37845
## genresDocumentary 44711638 23188936 1.93 0.05397 .
## genresDrama 4923647 7294556 0.67 0.49976
## genresFamily 82467003 20391297 4.04 5.4e-05 ***
## genresFantasy 2870822 13660650 0.21 0.83357
## genresHistory 15689738 24998838 0.63 0.53032
## genresHorror 17619142 10531825 1.67 0.09448 .
## genresMusic 23079584 29354497 0.79 0.43182
## genresMystery 6103862 24024211 0.25 0.79946
## genresRomance 17057626 14926772 1.14 0.25327
## genresScience Fiction -15440346 15106799 -1.02 0.30686
## genresThriller -7679779 12337527 -0.62 0.53370
## genresWar -56563625 33719837 -1.68 0.09360 .
## genresWestern -3456414 24929422 -0.14 0.88974
## companyParamount Pictures 19120256 8303147 2.30 0.02139 *
## companySony Pictures 4555821 7970621 0.57 0.56767
## companyUniversal Pictures 16743343 7455117 2.25 0.02481 *
## companyWalt Disney 16706288 6373478 2.62 0.00882 **
## companyWarner Bros -535044 8559594 -0.06 0.95016
## seasonSummer -822742 6192423 -0.13 0.89431
## seasonFall -9929435 6117507 -1.62 0.10471
## seasonWinter -5891796 6282550 -0.94 0.34845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared: 0.725, Adjusted R-squared: 0.721
## F-statistic: 188 on 30 and 2145 DF, p-value: <2e-16
## budget popularity
## 2.01 2.20
## runtime score
## 1.51 1.68
## vote genresAdventure
## 3.04 1.40
## genresAnimation genresComedy
## 1.25 1.79
## genresCrime genresDocumentary
## 1.26 1.08
## genresDrama genresFamily
## 2.06 1.08
## genresFantasy genresHistory
## 1.14 1.07
## genresHorror genresMusic
## 1.32 1.04
## genresMystery genresRomance
## 1.04 1.13
## genresScience Fiction genresThriller
## 1.11 1.18
## genresWar genresWestern
## 1.03 1.06
## companyParamount Pictures companySony Pictures
## 1.09 1.11
## companyUniversal Pictures companyWalt Disney
## 1.12 1.18
## companyWarner Bros seasonSummer
## 1.08 1.61
## seasonFall seasonWinter
## 1.65 1.60
The p-values and t-values indicate that there is no significance among seasons. Season seems not to be a necessary predictor.
When inlcuding the season, genre and company in the model, the best numerical predictors are still budget, popularity and vote. The effects of different seasons seem not to be significant. We will build the model with these 3 numerical predictors and 2 categorical variables: genre and company.
##
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity,
## data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.14e+08 -4.04e+07 -8.25e+05 2.96e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103783104 5493099 18.89 < 2e-16 ***
## budget 79307464 2810718 28.22 < 2e-16 ***
## vote 87343750 3432074 25.45 < 2e-16 ***
## companyParamount Pictures 20187352 8296723 2.43 0.01505 *
## companySony Pictures 4706890 7968099 0.59 0.55477
## companyUniversal Pictures 17875529 7436574 2.40 0.01631 *
## companyWalt Disney 16364447 6369330 2.57 0.01026 *
## companyWarner Bros -135185 8558273 -0.02 0.98740
## genresAdventure 14924242 8645045 1.73 0.08443 .
## genresAnimation 80203108 12811282 6.26 4.6e-10 ***
## genresComedy 24197175 7152449 3.38 0.00073 ***
## genresCrime -8821228 11302914 -0.78 0.43522
## genresDocumentary 40500854 22990580 1.76 0.07827 .
## genresDrama 6560315 6935298 0.95 0.34429
## genresFamily 78112299 20296238 3.85 0.00012 ***
## genresFantasy 1516100 13618042 0.11 0.91136
## genresHistory 22335236 24701813 0.90 0.36599
## genresHorror 15897705 10491528 1.52 0.12985
## genresMusic 22199706 29264598 0.76 0.44818
## genresMystery 3544411 24017226 0.15 0.88269
## genresRomance 16669177 14880624 1.12 0.26276
## genresScience Fiction -15795028 15101070 -1.05 0.29570
## genresThriller -7928898 12337416 -0.64 0.52051
## genresWar -55041648 33646312 -1.64 0.10201
## genresWestern 726641 24731085 0.03 0.97656
## popularity 13447493 2930278 4.59 4.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared: 0.724, Adjusted R-squared: 0.721
## F-statistic: 225 on 25 and 2150 DF, p-value: <2e-16
## budget vote
## 1.73 2.59
## companyParamount Pictures companySony Pictures
## 1.09 1.11
## companyUniversal Pictures companyWalt Disney
## 1.11 1.17
## companyWarner Bros genresAdventure
## 1.08 1.39
## genresAnimation genresComedy
## 1.17 1.78
## genresCrime genresDocumentary
## 1.22 1.06
## genresDrama genresFamily
## 1.86 1.07
## genresFantasy genresHistory
## 1.13 1.04
## genresHorror genresMusic
## 1.30 1.03
## genresMystery genresRomance
## 1.04 1.12
## genresScience Fiction genresThriller
## 1.11 1.17
## genresWar genresWestern
## 1.03 1.04
## popularity
## 2.19
The adj R-squared increases by 1.0% comparing to the the best model with numerical variables.
Testing
## mae rmse
## 6.19e+07 1.05e+08
Training
## mae rmse
## 58333154 98791039
A slight improvement in this model. Adj R-squared increase by 1% and RMSE in both training set and testing set slightly decrease.
Model 1: budget + vote + popularity Model 2: budget + vote + popularity + company + genres
Model 1 has higher AIC than Model 2, which indicates that Model 2 is better for predicting the revenue. Model 1 has lower BIC than Model 2, which indicates that Model 1 is better as a true function to explain the revenue. (BIC prefers simple models)
With Decision Tree we can address both numerical and categorical variables in the model. We perform the prediction of revenue (a continuous response) so we use regression tree.
We can try two functions to build a regression tree model
##
## Regression tree:
## tree(formula = revenue ~ ., data = train1_full)
## Variables actually used in tree construction:
## [1] "vote" "budget" "genres"
## Number of terminal nodes: 10
## Residual mean deviance: 9.85e+15 = 2.13e+19 / 2170
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.75e+08 -3.76e+07 -1.25e+07 0.00e+00 2.52e+07 1.24e+09
## Call:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
## n= 2176
##
## CP nsplit rel error xerror xstd
## 1 0.3861 0 1.000 1.001 0.1190
## 2 0.1250 1 0.614 0.709 0.0853
## 3 0.0704 2 0.489 0.535 0.0660
## 4 0.0528 3 0.418 0.483 0.0641
## 5 0.0265 4 0.366 0.455 0.0642
## 6 0.0256 5 0.339 0.452 0.0643
## 7 0.0139 6 0.313 0.412 0.0525
## 8 0.0107 7 0.300 0.383 0.0515
## 9 0.0100 8 0.289 0.385 0.0516
##
## Variable importance
## vote popularity budget score genres season
## 42 22 20 7 6 1
## runtime company
## 1 1
##
## Node number 1: 2176 observations, complexity param=0.386
## mean=1.2e+08, MSE=3.53e+16
## left son=2 (1947 obs) right son=3 (229 obs)
## Primary splits:
## vote < 0.949 to the left, improve=0.3860, (0 missing)
## budget < 1.17 to the left, improve=0.3840, (0 missing)
## popularity < 1.27 to the left, improve=0.3370, (0 missing)
## genres splits as RRRLLLLRRLLLLLRLLL, improve=0.0995, (0 missing)
## runtime < 0.609 to the left, improve=0.0587, (0 missing)
## Surrogate splits:
## popularity < 0.833 to the left, agree=0.948, adj=0.502, (0 split)
## budget < 2.45 to the left, agree=0.911, adj=0.153, (0 split)
## score < 1.79 to the left, agree=0.902, adj=0.066, (0 split)
##
## Node number 2: 1947 observations, complexity param=0.0704
## mean=8e+07, MSE=9.5e+15
## left son=4 (1481 obs) right son=5 (466 obs)
## Primary splits:
## vote < -0.099 to the left, improve=0.2930, (0 missing)
## popularity < -0.00948 to the left, improve=0.2540, (0 missing)
## budget < 0.716 to the left, improve=0.2480, (0 missing)
## company splits as LRRRRR, improve=0.0571, (0 missing)
## genres splits as LRRLLLLRRLLLLLLLLL, improve=0.0563, (0 missing)
## Surrogate splits:
## popularity < 0.0996 to the left, agree=0.922, adj=0.674, (0 split)
## budget < 1.09 to the left, agree=0.791, adj=0.129, (0 split)
## score < 2.02 to the left, agree=0.762, adj=0.004, (0 split)
## runtime < 4.4 to the left, agree=0.761, adj=0.002, (0 split)
##
## Node number 3: 229 observations, complexity param=0.125
## mean=4.61e+08, MSE=1.25e+17
## left son=6 (101 obs) right son=7 (128 obs)
## Primary splits:
## budget < 0.739 to the left, improve=0.335, (0 missing)
## vote < 2.5 to the left, improve=0.269, (0 missing)
## genres splits as RRRLL-LRRLL-LLRLLR, improve=0.210, (0 missing)
## popularity < 1.8 to the left, improve=0.176, (0 missing)
## runtime < 1.18 to the left, improve=0.093, (0 missing)
## Surrogate splits:
## genres splits as RRRLL-LRRLL-LLRLLR, agree=0.790, adj=0.525, (0 split)
## score < 1.44 to the right, agree=0.716, adj=0.356, (0 split)
## vote < 1.97 to the left, agree=0.633, adj=0.168, (0 split)
## popularity < 1.15 to the left, agree=0.607, adj=0.109, (0 split)
## season splits as RRLR, agree=0.607, adj=0.109, (0 split)
##
## Node number 4: 1481 observations, complexity param=0.0139
## mean=5.04e+07, MSE=3.52e+15
## left son=8 (772 obs) right son=9 (709 obs)
## Primary splits:
## vote < -0.497 to the left, improve=0.2040, (0 missing)
## budget < -0.32 to the left, improve=0.1950, (0 missing)
## popularity < -0.38 to the left, improve=0.1820, (0 missing)
## company splits as LRRRRR, improve=0.0553, (0 missing)
## runtime < 0.18 to the left, improve=0.0165, (0 missing)
## Surrogate splits:
## popularity < -0.389 to the left, agree=0.916, adj=0.825, (0 split)
## budget < -0.32 to the left, agree=0.630, adj=0.227, (0 split)
## genres splits as RLRLRLLLRLRLRLRRRL, agree=0.568, adj=0.097, (0 split)
## company splits as LRLRLL, agree=0.565, adj=0.092, (0 split)
## score < -0.423 to the left, agree=0.554, adj=0.068, (0 split)
##
## Node number 5: 466 observations, complexity param=0.0256
## mean=1.74e+08, MSE=1.69e+16
## left son=10 (322 obs) right son=11 (144 obs)
## Primary splits:
## budget < 0.649 to the left, improve=0.2500, (0 missing)
## genres splits as LLRLL-LRLRLLLLLLLL, improve=0.1220, (0 missing)
## company splits as LRRRRR, improve=0.0808, (0 missing)
## score < 0.973 to the right, improve=0.0708, (0 missing)
## vote < 0.432 to the left, improve=0.0478, (0 missing)
## Surrogate splits:
## genres splits as LRRLL-LRLRLLLLLLLL, agree=0.755, adj=0.208, (0 split)
## score < -0.423 to the right, agree=0.710, adj=0.063, (0 split)
## popularity < 1.54 to the left, agree=0.695, adj=0.014, (0 split)
## vote < 0.909 to the left, agree=0.695, adj=0.014, (0 split)
##
## Node number 6: 101 observations, complexity param=0.0107
## mean=2.3e+08, MSE=3.14e+16
## left son=12 (85 obs) right son=13 (16 obs)
## Primary splits:
## genres splits as LRRLL-L-LLL-RLLLL-, improve=0.2590, (0 missing)
## vote < 2.55 to the left, improve=0.2120, (0 missing)
## budget < -0.192 to the left, improve=0.1350, (0 missing)
## popularity < 1.81 to the left, improve=0.0997, (0 missing)
## season splits as RRLR, improve=0.0770, (0 missing)
## Surrogate splits:
## runtime < -0.941 to the right, agree=0.851, adj=0.063, (0 split)
## score < -0.365 to the right, agree=0.851, adj=0.063, (0 split)
##
## Node number 7: 128 observations, complexity param=0.0528
## mean=6.43e+08, MSE=1.24e+17
## left son=14 (71 obs) right son=15 (57 obs)
## Primary splits:
## vote < 2.44 to the left, improve=0.2550, (0 missing)
## budget < 3.78 to the left, improve=0.2390, (0 missing)
## popularity < 3.14 to the left, improve=0.2330, (0 missing)
## runtime < 1.13 to the left, improve=0.0954, (0 missing)
## score < -0.772 to the left, improve=0.0715, (0 missing)
## Surrogate splits:
## score < 0.624 to the left, agree=0.719, adj=0.368, (0 split)
## popularity < 1.67 to the left, agree=0.711, adj=0.351, (0 split)
## runtime < 1.13 to the left, agree=0.656, adj=0.228, (0 split)
## budget < 2.57 to the left, agree=0.641, adj=0.193, (0 split)
## company splits as LLLLRR, agree=0.633, adj=0.175, (0 split)
##
## Node number 8: 772 observations
## mean=2.47e+07, MSE=7.93e+14
##
## Node number 9: 709 observations
## mean=7.84e+07, MSE=4.98e+15
##
## Node number 10: 322 observations
## mean=1.3e+08, MSE=9.82e+15
##
## Node number 11: 144 observations
## mean=2.71e+08, MSE=1.91e+16
##
## Node number 12: 85 observations
## mean=1.91e+08, MSE=1.99e+16
##
## Node number 13: 16 observations
## mean=4.38e+08, MSE=4.15e+16
##
## Node number 14: 71 observations
## mean=4.83e+08, MSE=4.95e+16
##
## Node number 15: 57 observations, complexity param=0.0265
## mean=8.41e+08, MSE=1.46e+17
## left son=30 (47 obs) right son=31 (10 obs)
## Primary splits:
## budget < 3.98 to the left, improve=0.2450, (0 missing)
## popularity < 2.88 to the left, improve=0.2080, (0 missing)
## vote < 4.63 to the left, improve=0.1260, (0 missing)
## score < 1.32 to the right, improve=0.0656, (0 missing)
## genres splits as RRR-L-LRL-----R--L, improve=0.0449, (0 missing)
## Surrogate splits:
## vote < 7.31 to the left, agree=0.842, adj=0.1, (0 split)
##
## Node number 30: 47 observations
## mean=7.54e+08, MSE=6.86e+16
##
## Node number 31: 10 observations
## mean=1.25e+09, MSE=3.06e+17
##
## Regression tree:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
##
## Variables actually used in tree construction:
## [1] budget genres vote
##
## Root node error: 8e+19/2176 = 4e+16
##
## n= 2176
##
## CP nsplit rel error xerror xstd
## 1 0.39 0 1.0 1.0 0.12
## 2 0.13 1 0.6 0.7 0.09
## 3 0.07 2 0.5 0.5 0.07
## 4 0.05 3 0.4 0.5 0.06
## 5 0.03 4 0.4 0.5 0.06
## 6 0.03 5 0.3 0.5 0.06
## 7 0.01 6 0.3 0.4 0.05
## 8 0.01 7 0.3 0.4 0.05
## 9 0.01 8 0.3 0.4 0.05
2 methods give the same tree. rpart() gives access to nicer plots.
Testing
## mae rmse
## 6.54e+07 1.13e+08
Training
## mae rmse
## 5.93e+07 1.01e+08
The results seem to be worse than linear models.
We can see the error for each Cp.
##
## Regression tree:
## rpart(formula = revenue ~ ., data = train1_full, method = "anova")
##
## Variables actually used in tree construction:
## [1] budget genres vote
##
## Root node error: 8e+19/2176 = 4e+16
##
## n= 2176
##
## CP nsplit rel error xerror xstd
## 1 0.39 0 1.0 1.0 0.12
## 2 0.13 1 0.6 0.7 0.09
## 3 0.07 2 0.5 0.5 0.07
## 4 0.05 3 0.4 0.5 0.06
## 5 0.03 4 0.4 0.5 0.06
## 6 0.03 5 0.3 0.5 0.06
## 7 0.01 6 0.3 0.4 0.05
## 8 0.01 7 0.3 0.4 0.05
## 9 0.01 8 0.3 0.4 0.05
## 1 2 3 4 5 6 7 8 9
## 1.001 0.709 0.535 0.483 0.455 0.452 0.412 0.383 0.385
The error is lowest at CP = 8.
Testing
## mae rmse
## 6.59e+07 1.15e+08
Training
## mae rmse
## 6.04e+07 1.03e+08
There are not significant changes in the results.
The data contains many variables. Furthermore, we are predicting a testing data using the model constructed on training set. Therefore, Random Forest (RF) would be better than Decision Tree (which prefers fewer variables and predicts within the sample/training data).
##
## Call:
## randomForest(formula = revenue ~ ., data = train1_full, ntree = 350)
## Type of random forest: regression
## Number of trees: 350
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 9.46e+15
## % Var explained: 73.2
## IncNodePurity
## budget 1.90e+19
## popularity 1.43e+19
## runtime 4.15e+18
## score 3.39e+18
## vote 2.33e+19
## genres 5.58e+18
## company 2.60e+18
## season 1.49e+18
Budget + popularity + score have highest importances. In our best linear model, the selected numerical variables are also the same as Random Forest. Two models seem to have an agreement.
There is a difference when accommodating categorical variables. Linear Model does not select runtime but selects company and genres to be added, while in Random Forest model runtime is more importanct than company. In this case, both models seem to agree on the genres variable only.
When the number of trees increase, the mean squared error MSE decease. After a number of trees (around 100 trees in our case), the MSE does not have any significant change.
Testing
## mae rmse
## 55398015 98520833
Training
## mae rmse
## 52851250 97252127
Random Forest has lower RMSEs and MAEs than Linear Model.
The pseudo R-squared of RF is slightly higher than Adj R-squared of Linear Model.
Let’s look if we can make our RF model better by using tuning method.
## mtry = 2 OOB error = 9.46e+15
## Searching left ...
## mtry = 1 OOB error = 1.06e+16
## -0.118 0.05
## Searching right ...
## mtry = 4 OOB error = 9.33e+15
## 0.014 0.05
mtry giving lowest OOB Error is 4. Now, let’s build a RF model with mtry = 4.
##
## Call:
## randomForest(formula = revenue ~ ., data = train1_full, mtry = 4, ntree = 350)
## Type of random forest: regression
## Number of trees: 350
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 9.17e+15
## % Var explained: 74
An improvement in pseudo R-squared. %Var explained increases by 1%.
Testing
## mae rmse
## 5.5e+07 9.7e+07
Training
## mae rmse
## 5.2e+07 9.6e+07
We can see the decreases in MAE and RMSE after tuning the forest.
| Model | Linear Model (budget + vote + popularity + company + genres) | Regression Tree | Random Forest |
|---|---|---|---|
| R-squared(adjusted/ pseudo) | 0.721 | 0.711 | 0.741 |
| MAE - train | 5.83310^{7} | 6.04410^{7} | 5.2e+07 |
| MAE - test | 6.18810^{7} | 6.59510^{7} | 5.5e+07 |
| RMSE - train | 9.87910^{7} | 1.02910^{8} | 9.6e+07 |
| RMSE - test | 1.04510^{8} | 1.1510^{8} | 9.7e+07 |
We want to examine if Principal Component Analysis method is good at dimensional reduction. In our data, we have 5 continuous variables to predict the revenue. We will perform principal component regression directly and see how many components are sufficient for the model.
In this part, we will also clarify the reason to scale our data in previous chapter by comparing two PCR models of different versions: centered data and non-centered data.
We will also include revenue when checking variance.
Non-centered data
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02 0 0.0 0.0 0.000
## Cumulative Proportion 9.74e-01 1.00e+00 1 1.0 1.0 1.000
Centered data
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion 0.521 0.725 0.859 0.9284 0.9708 1.0000
We see a significant differences in the variances and standard deviations of different components. Actually, the revenue and our predictors except budget are measured by different scales. And, the revenue and budget are measured in millions which are significantly bigger than the scales of vote, score, popularity …
Therefore, we recommend to scale the data before constructing the model. We will see the differences between non-centered and centered data in the following models.
In non-centered version, 1 component explains almost 100% of the variance. In centered version, 1 component only explains about 50% of the variance. To reach 80%, we need 3 components. Another evidence to indicate that we should perform scaling since the the revenue and budget seem to overwhelm the components in non-centered version, causing the PC1 to capture nearly the entire variance.
Non-centered version
Centered version
Both versions agree that 2 components give the optimized MSEP and R2.
We can see the coefficients of different components.
We can see that after PC2, the change of coefficients is not significant comparing to the difference between the coeffcients of PC1 and PC2. Hence, we can use PC1 and PC2 to optimize the model since two components are sufficient to capture the variance.
We can make a comparision between non-centered and centered version.
Non-centered version
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 122426183 109504309 107691445 103280300 104020634
## adjCV 1.88e+08 122348421 109403985 107658736 103203256 103846041
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 49.00 71.71 87.31 95.63 100.00
## revenue 57.87 66.61 67.94 70.47 71.13
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 122115877 106538623 105948677 103446210 102956254
## adjCV 1.88e+08 122008326 106436756 105914036 103383690 102847338
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 48.50 71.69 87.41 95.61 100.00
## revenue 58.21 68.09 68.67 70.49 71.13
Scaled data have better variance explanation for revenue than non-scaled data.
Let’s try the pcr model on testing data. We will use the centered version.
There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not too drastic. We can say that 2 principal components can capture the majority of variance in the testing data. Our PCR model seems to perform properly in the testing data.
We can use principal components as the predictors for a linear model. We will use two components to build the models with two versions: centered and non-centered.
Non-centered version
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20e+09 -4.08e+07 -5.95e+06 2.28e+07 1.76e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121493614 2330120 52.1 <2e-16 ***
## PC1 -89813816 1463518 -61.4 <2e-16 ***
## PC2 51258272 2149532 23.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.666
## F-statistic: 2.17e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
Centered version
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11e+09 -4.03e+07 -4.95e+06 2.42e+07 1.73e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120012282 2277573 52.7 <2e-16 ***
## PC1 -92108152 1462914 -63.0 <2e-16 ***
## PC2 54901946 2115458 25.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.681
## F-statistic: 2.32e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
All vifs are 1, nice feature from PCA method.
The scaled version gives better results than non-scaled version.
We can also use AIC and BIC for validation between non-scaled and scaled versions.
## [1] 86611
## [1] 86633
## [1] 86710
## [1] 86732
Both AIC and BIC agree that the scaled version is better.
Comparing to the linear model (of numerical variables) in previous chapter, the Adjusted R-squared slightly decreases. The value drops from 71.1% to 68.1%. It is in our expectation since the goal of PCA is to reduce dimensionality. Instead of using 5 variables (budget + popularity + vote + score + runtime + ), we need only 2 variables (PC1 and PC2). We can speed up the computational process while not significantly hurting the performance of the model
In this chapter, we use different kinds of model to predict if a movie earns profit or not (box office revenue > budget -> earn profit). In our data, we use column profitable to record whether a movie earns profit or not; a movie with profit is labeled as 1, otherwise it is labeled as 0.
Genres
Company
Season
In this part we will construct the logit model on the whole dataset.
Budget and revenue are enough to decide the profitable as the profit is calculated as revenue subtracted by budget. Our pre-test with “bestglm” also shows the same result.
## revenue budget popularity runtime score vote genres company season
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## Criterion
## 1 17.4
## 2 17.6
## 3 18.1
## 4 18.3
## 5 18.5
However, since the relationship between (revenue + budget) and profitable is too direct, we should not use them together.
In reality, we prefer budget rather than revenue to predict profit. A film manager would want to have a prediction of the profit of a movie before its main released date. The information he/she have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening of a movie, popularity can be generated after advertisement, trailers and some leaks from a movie). Revenue should play a role as the response in the model rather than a predictor.
Let’s try the model with budget and other predictors
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.683 0.000 0.297 0.735 1.743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.05e+00 5.35e-01 -3.84 0.00012 ***
## budget -1.75e-08 2.51e-09 -6.99 2.8e-12 ***
## popularity 1.61e-02 1.26e-02 1.28 0.20159
## runtime 1.66e-03 3.28e-03 0.50 0.61421
## score 2.27e-01 8.46e-02 2.68 0.00739 **
## vote 2.67e-03 4.49e-04 5.95 2.7e-09 ***
## genresAdventure 1.73e-01 2.51e-01 0.69 0.49015
## genresAnimation 2.70e-01 3.97e-01 0.68 0.49646
## genresComedy 4.31e-01 1.91e-01 2.26 0.02360 *
## genresCrime 7.35e-02 2.96e-01 0.25 0.80411
## genresDocumentary 4.66e-01 5.25e-01 0.89 0.37479
## genresDrama 7.25e-02 1.91e-01 0.38 0.70424
## genresFamily 2.31e-01 5.54e-01 0.42 0.67656
## genresFantasy 2.28e-01 4.32e-01 0.53 0.59765
## genresHistory 6.55e-01 7.13e-01 0.92 0.35873
## genresHorror 7.92e-01 3.16e-01 2.51 0.01212 *
## genresMusic 2.74e-01 7.09e-01 0.39 0.69916
## genresMystery -4.23e-01 6.33e-01 -0.67 0.50393
## genresRomance 8.50e-01 4.26e-01 2.00 0.04590 *
## genresScience Fiction 2.64e-01 5.05e-01 0.52 0.60148
## genresThriller -1.50e-01 3.29e-01 -0.46 0.64724
## genresWar -1.36e+00 8.29e-01 -1.64 0.10019
## genresWestern 2.27e+00 1.08e+00 2.10 0.03582 *
## companyParamount Pictures 9.81e-01 2.43e-01 4.03 5.5e-05 ***
## companySony Pictures 6.20e-01 2.22e-01 2.79 0.00524 **
## companyUniversal Pictures 8.52e-01 2.34e-01 3.64 0.00027 ***
## companyWalt Disney 8.60e-01 1.84e-01 4.67 3.0e-06 ***
## companyWarner Bros 7.69e-01 2.45e-01 3.13 0.00172 **
## seasonSummer 3.90e-01 1.74e-01 2.24 0.02505 *
## seasonFall -4.65e-02 1.62e-01 -0.29 0.77361
## seasonWinter 7.04e-02 1.69e-01 0.42 0.67734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2409.2 on 2175 degrees of freedom
## Residual deviance: 1805.0 on 2145 degrees of freedom
## AIC: 1867
##
## Number of Fisher Scoring iterations: 8
We can test the effects of different genres/companies/seasons on the prediction to see whether they are significant.
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 23.8, df = 17, P(> X2) = 0.12
It seems that different genres do not have significant effects on the response in our logit model.
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 46.2, df = 5, P(> X2) = 8.4e-09
The effects of different companies are significant.
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 8.0, df = 3, P(> X2) = 0.045
The effects of different seasons are significant, but not as clear as the effects of different companies.
## budget popularity runtime score vote genres company season Criterion
## 1 TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE 1855
## 2 TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE 1856
## 3 TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE 1857
## 4 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE 1858
## 5 TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE 1858
Build the best model
##
## Call:
## glm(formula = y ~ budget + score + vote + company + season, family = "binomial",
## data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.733 0.000 0.306 0.763 1.770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.44e+00 4.57e-01 -3.15 0.00163 **
## budget -1.83e-08 2.30e-09 -7.95 1.9e-15 ***
## score 2.10e-01 7.18e-02 2.93 0.00340 **
## vote 3.18e-03 2.37e-04 13.43 < 2e-16 ***
## companyParamount Pictures 9.64e-01 2.40e-01 4.02 5.7e-05 ***
## companySony Pictures 5.48e-01 2.19e-01 2.50 0.01237 *
## companyUniversal Pictures 8.42e-01 2.30e-01 3.67 0.00024 ***
## companyWalt Disney 8.75e-01 1.80e-01 4.86 1.2e-06 ***
## companyWarner Bros 7.88e-01 2.41e-01 3.27 0.00109 **
## seasonSummer 3.84e-01 1.72e-01 2.24 0.02513 *
## seasonFall -6.93e-02 1.59e-01 -0.43 0.66375
## seasonWinter 5.37e-02 1.66e-01 0.32 0.74708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2409.2 on 2175 degrees of freedom
## Residual deviance: 1832.9 on 2164 degrees of freedom
## AIC: 1857
##
## Number of Fisher Scoring iterations: 7
We can validate the model on the testing set with following methods:
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: test3$y, prof_glm_pred
## X-squared = 1049, df = 8, p-value <2e-16
Low p-value. Both models seem to be a good fit.
## Area under the curve: 0.849
The area under the curve is more than 0.80. This test also agrees with the Hosmer and Lemeshow test.
## llh llhNull G2 McFadden r2ML r2CU
## -916.473 -1204.608 576.271 0.239 0.233 0.348
23.9% the variance in y is explained by the predictors in our model. Not so bad but not so good.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 248 416
## 1 12 373
##
## Accuracy : 0.592
## 95% CI : (0.562, 0.622)
## No Information Rate : 0.752
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.28
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.954
## Specificity : 0.473
## Pos Pred Value : 0.373
## Neg Pred Value : 0.969
## Prevalence : 0.248
## Detection Rate : 0.236
## Detection Prevalence : 0.633
## Balanced Accuracy : 0.713
##
## 'Positive' Class : 0
##
## threshold accuracy
## 1 0.5 81.9
## 2 0.6 80.7
## 3 0.7 77.4
## 4 0.8 69.0
## 5 0.9 59.2
We can visualize our tree
We can optimize our tree by pruning (in complex model, pruning helps to reduce overfitting).
##
## Classification tree:
## rpart(formula = y ~ ., data = train3, method = "class")
##
## Variables actually used in tree construction:
## [1] budget company genres vote
##
## Root node error: 527/2176 = 0.2
##
## n= 2176
##
## CP nsplit rel error xerror xstd
## 1 0.08 0 1.0 1.0 0.04
## 2 0.03 2 0.8 0.9 0.04
## 3 0.02 4 0.8 0.9 0.04
## 4 0.01 7 0.7 0.8 0.04
## 5 0.01 8 0.7 0.8 0.04
Relative error is lowest at CP = 4, numbers of split = 7.
Let’s see our tree after pruning.
The last split was pruned.
## predicted
## actual 0 1
## 0 88 172
## 1 35 754
## [1] 0.803
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 35
## 1 172 754
##
## Accuracy : 0.803
## 95% CI : (0.777, 0.826)
## No Information Rate : 0.752
## P-Value [Acc > NIR] : 6.09e-05
##
## Kappa : 0.357
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.3385
## Specificity : 0.9556
## Pos Pred Value : 0.7154
## Neg Pred Value : 0.8143
## Prevalence : 0.2479
## Detection Rate : 0.0839
## Detection Prevalence : 0.1173
## Balanced Accuracy : 0.6471
##
## 'Positive' Class : 0
##
## Area under the curve: 0.764
The area under the curve is 0.764 (smaller than 0.8). Classification tree seems not be as good as Logistic Regression in this case.
We will use KNN model in this part
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | season_knn
## test2$season | Spring | Summer | Fall | Winter | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Spring | 40 | 63 | 64 | 47 | 214 |
## | 0.187 | 0.294 | 0.299 | 0.220 | 0.204 |
## | 0.189 | 0.243 | 0.190 | 0.195 | |
## | 0.038 | 0.060 | 0.061 | 0.045 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Summer | 75 | 84 | 63 | 59 | 281 |
## | 0.267 | 0.299 | 0.224 | 0.210 | 0.268 |
## | 0.354 | 0.324 | 0.187 | 0.245 | |
## | 0.071 | 0.080 | 0.060 | 0.056 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Fall | 49 | 72 | 125 | 81 | 327 |
## | 0.150 | 0.220 | 0.382 | 0.248 | 0.312 |
## | 0.231 | 0.278 | 0.371 | 0.336 | |
## | 0.047 | 0.069 | 0.119 | 0.077 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Winter | 48 | 40 | 85 | 54 | 227 |
## | 0.211 | 0.176 | 0.374 | 0.238 | 0.216 |
## | 0.226 | 0.154 | 0.252 | 0.224 | |
## | 0.046 | 0.038 | 0.081 | 0.051 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 212 | 259 | 337 | 241 | 1049 |
## | 0.202 | 0.247 | 0.321 | 0.230 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##
k= 35 gives the best accuracy
There are many genres so we won’t show the cross table
k = 27 gives the best accuracy
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | company_knn
## test2$company | Others | Paramount Pictures | Sony Pictures | Universal Pictures | Walt Disney | Warner Bros | Row Total |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Others | 447 | 10 | 10 | 17 | 48 | 2 | 534 |
## | 0.837 | 0.019 | 0.019 | 0.032 | 0.090 | 0.004 | 0.509 |
## | 0.548 | 0.370 | 0.312 | 0.395 | 0.381 | 0.400 | |
## | 0.426 | 0.010 | 0.010 | 0.016 | 0.046 | 0.002 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Paramount Pictures | 67 | 2 | 1 | 2 | 14 | 0 | 86 |
## | 0.779 | 0.023 | 0.012 | 0.023 | 0.163 | 0.000 | 0.082 |
## | 0.082 | 0.074 | 0.031 | 0.047 | 0.111 | 0.000 | |
## | 0.064 | 0.002 | 0.001 | 0.002 | 0.013 | 0.000 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Sony Pictures | 62 | 5 | 3 | 9 | 9 | 0 | 88 |
## | 0.705 | 0.057 | 0.034 | 0.102 | 0.102 | 0.000 | 0.084 |
## | 0.076 | 0.185 | 0.094 | 0.209 | 0.071 | 0.000 | |
## | 0.059 | 0.005 | 0.003 | 0.009 | 0.009 | 0.000 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Universal Pictures | 81 | 3 | 9 | 7 | 17 | 0 | 117 |
## | 0.692 | 0.026 | 0.077 | 0.060 | 0.145 | 0.000 | 0.112 |
## | 0.099 | 0.111 | 0.281 | 0.163 | 0.135 | 0.000 | |
## | 0.077 | 0.003 | 0.009 | 0.007 | 0.016 | 0.000 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Walt Disney | 109 | 7 | 7 | 7 | 27 | 2 | 159 |
## | 0.686 | 0.044 | 0.044 | 0.044 | 0.170 | 0.013 | 0.152 |
## | 0.134 | 0.259 | 0.219 | 0.163 | 0.214 | 0.400 | |
## | 0.104 | 0.007 | 0.007 | 0.007 | 0.026 | 0.002 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Warner Bros | 50 | 0 | 2 | 1 | 11 | 1 | 65 |
## | 0.769 | 0.000 | 0.031 | 0.015 | 0.169 | 0.015 | 0.062 |
## | 0.061 | 0.000 | 0.062 | 0.023 | 0.087 | 0.200 | |
## | 0.048 | 0.000 | 0.002 | 0.001 | 0.010 | 0.001 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## Column Total | 816 | 27 | 32 | 43 | 126 | 5 | 1049 |
## | 0.778 | 0.026 | 0.031 | 0.041 | 0.120 | 0.005 | |
## -------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##
##
k = 23 gives the best accuracy
We use time series model
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.55e+09 -6.35e+08 2.92e+08 0.00e+00 9.27e+08 9.67e+08
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06 4.29e+08 1.99e+09 4
Prediction on the future
We predict the model on the testing data
Better visualization with highcharter
We can also plot nicer with highcharter